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Abstract: A basic assumption of molecular biology is that proteins sharing 
close three-dimensional (3D) structures are likely to share a common function 
and in most cases derive from a same ancestor. Computing the similarity be- 
tween two protein structures is therefore a crucial task and has been extensively 
investigated. Evaluating the similarity of two proteins can be done by finding 
an optimal one-to-one matching between their components, which is equivalent 
to identifying a maximum weighted clique in a specific "alignment graph" . 

In this paper we present a new integer programming formulation for solv- 
ing such clique problems. The model has been implemented using the ILOG 
CPLEX Callable Library. In addition, we designed a dedicated branch and 
bound algorithm for solving the maximum cardinality clique problem. Both 
approaches have been integrated in VAST (Vector Alignment Search Tool) - a 
software for aligning protein 3D structures largely used in NCBI (National Cen- 
ter for Biotechnology Information). The original VAST clique solver uses the 
well known Bron and Kerbosh algorithm (BK). Our computational results on 
real life protein alignment instances show that our branch and bound algorithm 
is up to 116 times faster than BK for the largest proteins. 

Key-words: Branch and bound, integer programming, maximum clique prob- 
lems, protein structure alignment. 
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Probleme de clique maximum 
pour la similarity des structures de proteines 

Resume : Une hypothese fondamentale en biologie moleculaire est que des 
proteines partageant des structures 3D similaires ont de fortes chances de par- 
tager une fonction commune, et dans la plupart des cas derivent d'un ancetre 
commun. Evaluer la similarite entre deux structures de proteines est done unc 
tache essentielle qui a ete largement etudiee. Comparer deux proteines revient 
a chercher un mariage stable optimal de leurs composants, ce qui est equivalent 
a rechercher une clique de poids maximum dans un "graphc d'alignement" 
specifique. 

Dans ce rapport, nous presentons un nouveau modele de programmation 
lineaire en nombres entiers pour resoudre ces problemes de cliques. Ce modele 
a ete implements en utilisant CPLEX 10. Nous avons egalement congu un 
algorithme de branch and bound dedie pour resoudre le probleme de clique 
de cardinalite maximum. Ces deux approches ont ete integrees dans VAST 
(Vector Alignement Search Tool) - un logicicl pour Paligncmcnt des structures 
3D de protines largement utilise au NCBI (National Center for Biotechnology 
Information). Pour resoudre ses problemes de cliques, VAST utilise l'algorithme 
de Bron et Kerbosh (BK). Les resultats experimentaux sur des instances reelles 
d'aligncmcnts dc proteines montrcnt que notre algorithme dc branch and bound 
est jusqu'a 116 fois plus rapide que BK pour les plus grandes proteines. 

Mots-cles : Branch and bound, programmation lineaire en nombres entiers, 
clique maximum, alignement de structures de proteines. 
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1 Introduction 

A protein is an ordered sequence of amino acids. Under specific physiological 
conditions, the linear arrangement of amino acids will fold and adopt a complex 
3D shape. This 3D shape contains some highly regular sub-structures called 
secondary structures elements (SSEs), such as ahelices and /3strands. 

A fruitful assumption of molecular biology is that proteins sharing close 
three-dimensional (3D) structures are likely to share a common function and in 
most cases derive from a same ancestor. Computing the similarity between two 
protein structures is therefore a crucial task and has been extensively investi- 
gated [1, 2, 3, 4]. Evaluating the similarity of two proteins Pi and Pi is usually 
done by evaluating the best - according to some criterion - one-to-one matching 
between their components (also called alignment). 

Among the various protein structure alignment methods, we are interested 
in VAST [5] (Vector Alignment Search Tool) - a software for aligning protein 
3D structures largely used in NCBI (National Center for Biotechnology Infor- 
mation) - where the problem of aligning two proteins is converted into finding 
a maximum clique in a specific graph. 

In VAST, two proteins P\ and Pi are represented by their ordered sets 
of SSEs (Vi and Vi respectively), and a structural alignment of Pi and P 2 is 
an order-preserving one-to-one matching between the SSEs of V\ and the ones 
of Vi. Such matchings are represented here by the so called alignment graph 
G = (V, E) . In our approach, the vertex set V is depicted by a grid, where each 
row corresponds to a SSE of Vi, while each column corresponds to a SSE of 
Vi. A vertex ik, situated on the intersection of row i £ V\ with column k 6 Vi 
exists (ik G V), iff the SSEs i and k are "compatible" (i.e. are of the same type 
and have similar lengths). An edge (ik,jl) between two vertices ik and jl exists 
((ik,jl) G E), iff the pair (ik,jl) is "compatible" (i.e. i < j and I < k (for order 
preserving) and if the SSEs couple i and j from V\ can be well superimposed 
in 3D-space to the SSEs couple k and I from Vi) 1 . To each vertex ik £ V we 
can associate a weight Sik £ R, and to each edge (ik, jl) G E we can associate 
a weight Cikji £ R. Finding a suitable secondary structure alignment is then 
equivalent to discovering a clique in the grid graph G. 

Looking for cliques in this kind of graphs arises in different situations, where 
matching in bipartite graphs preserving the order of vertices is sought. Such 
an example is another protein structure alignment method called Contact Map 
Overlap Maximization (CMO) [4]. 

Various clique related problems can be formulated in such grid graph. The 
Maximum Cardinality Clique problem MCC consists in finding in G a clique 
of maximum cardinality, denoted by MCC(G). MCC is one of the first problem 
shown to be NP-Complete [6]. The Maximum Vertex Weighted clique problem 
MVW consists in finding a clique with a maximum sum of vertex weights. 
If the vertex weights are all equal to 1, then MVW is equivalent to MCC. 
Since MCC is a special case of MVW, then MVW is also NP-Complete. The 
Maximum Edge Weighted clique problem MEW consists in finding a clique 
with a maximum sum of edge weights. Again, if the weights are all equal to 
1, then MEW is equivalent to MCC, so MEW is also NP-Complete. All these 
clique problems have been intensively investigated [7, 8, 9, 10, 11]. Moreover, 

1 See [5] for the exact definition of superimposing SSEs couples in 3D-space. 
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these three problems are all special cases of the Maximum Weighted Clique 
problem MWC, which consists in finding the clique having the maximum sum 
of vertex and edge weights. If the vertex weights are equal to zero, then MWC is 
equivalent to MEW, if edge weights are all equal to zero, then MWC is equivalent 
to MVW. If the vertex weights arc all equal to 1 and the edge weights are all 
equal to zero, then MWC is equivalent to MCC. 

In this article, we present a new mathematical programming model for solv- 
ing the most general clique problem MWC. The model has been implemented 
using ILOG CPLEX 10 Callable Library, and validated on the so called Skolnick 
set (widely used in protein structure comparison articles [1, 12, 13]). In addi- 
tion, we also design a dedicated branch and bound algorithm (B&B) for MCC. 
The B&B solver has been integrated in VAST and compared to the original 
VAST clique solver BK (the Bron and Kerbosh algorithm [7]) on large real life 
protein structures. The obtained results show that our B&B algorithm is up to 
116 times faster than BK for the largest proteins. 



2 Mathematical programming model for MWC 

The use of mathematical programming is not new in the field of maximum 
cliques [14, 15]. However, by using the properties of our graphs, we designed a 
new linear mathematical programming model for solving the maximum weighted 
clique problem (and thus solving MCC, MVW and MEW) on a grid G = (V, E), 
where the weights associated to the vertices and edges are all in M. 

To each vertex ik G V (in row i G V\ and column k 6 V2), we associate a 
binary variable Xik such that : 

J 1 if vertex ik is in the clique, 
Xlk _ \ otherwise. 



In the same way, to each edge (ik,jl) G E, we associate a binary variable yikji 
such that : 

J 1 if edge (ik,jl) is in the clique, 
Vikil ~ 1 otherwise. 



The goal is to find a clique which maximizes the sum of its vertex weights 
and the sum of its edge weights. This leads to the objective function : 

Zmwc = max ^ S ik Xik + ^ ] Cikjl Vikjl- (1) 
ik (ik.jl) 

The one-to-one matching implies that in each row i G V\ , at most one vertex 
can be chosen (only one Xik can be set to 1). 

5> jfc <i, Vie^i. (2) 
k 

The same holds for the columns. 

5> ifc <i, Vfcey 2 . (3) 

i 

These special order set constraints (maximum one activated vertex per row and 
per column) lead to compact formulations of the relations between vertices and 
edges. 
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Denote by d^ ol (ik) the set of columns with indices I greater than k and such 
that there exists at least one edge outgoing from the vertex ik and heading to a 
vertex in column I. In a similar way we introduce the notations d~ ol (ik) , d+ ow (ik) 
and d~ ow (ik). More precisely, d~ ol (ik) is the set of columns with indices / smaller 
than k and such that there exists at least one edge heading to the vertex ik and 
coming from a vertex in column I. d+ ow (ik) is the set of rows with indices j 
greater than i and such that there exists at least one edge outgoing from ik and 
heading to a vertex in row j. And finally, d~ ow (ik) is the set of rows with indices 
j smaller than i and such that there exists at least one edge heading to ik and 
coming from a vertex in row j. 

Edges driven activations of vertices can be formulated with the following 
compact inequalities : 

x ik > ^VikjU Vik G V, VI G d+Jik). (4) 

j 

xji > Vikji, Vjl e^Vfce d- ol (jl). (5) 

i 

x ik > J^Vikji, Vifc g V, Vj g d+ ow (ik). (6) 

Xji > Vikju Vj7 G V, Vi e d~ ow (jl). (7) 
k 

Vertices driven activations of edges can be formulated with the following 
compact inequalities : 

Y^ Xik+ Yj x 3 i ~ H y^ 1 - Vfc e V2 ' vi<eV ^ k<L ( 8 ) 

i j ij 

x ik + x i l - H V^ 1 < !. V* € Vi, Vj G V u i < j. (9) 
k i ki 



Remarks: 

Our model is designed to perform on alignment grids, and thus takes advantages 
of characteristics that grid alike graphs do not to share with randomly generated 
graphs. The properties "one-to-one matching" and "order preserving" lead to 
the creation of special order sets. In the mathematical model this is illustrated 
by constraints (2) and (3). Moreover, this leads to the implication u (ik,jl) E E 
implies i < j and k < V . As a consequence, if ik is in a clique C, all vertices jl 
such that j > i and I < k, or j < i and I > k, cannot be in the clique C. 



3 Branch and Bound approach for MCC 

We present here a new branch and bound algorithm (B&B) for solving the MCC 
problem in the previously defined grid G = (V, E), where the vertices in V are 
labeled by their coordinates ik, i for the row number and k for the column 
number. 

Definition 1 A successor of a vertex ik G G is an element of the setT^,{ik) = 
{jl G V s.t. (ik,jl) £ E,i < j and k < I}. 
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Definition 2 A predecessor of a vertex ik £ G is an element of the set 
Tq(Hc) = {jl £ V s.t. (jl,ik) £ E,j < i and I < k}. 

We also use the following notations: Ag(ifc) denotes the subgraph of G 
induced by the vertices in T^ik); A^(ik) denotes the subgraph of G induced 
by the vertices in T^(ik); G s denotes the subgraph of G induced by the set 
of vertices S c V. 

3.1 Branch and Bound rules 

3.1.1 Branching: 

Each node of the B&B is characterized by a couple (C, Cand) where C is the 
clique under construction and Cand is the set of candidate vertices to be added 
to C. All B&B nodes can also access to Cbest, the best clique found so far during 
the exploration of the B&B tree. At the root of the B&B tree, these arguments 
are initially set to C = 0, Cand = V and Cbest = 0- 

For a given B&B node N = (C\ Cand), the vertices in Cand will be visited 
according to their lexicographic increasing order (row first). We call NEXT {Cand) 
a function returning the vertex ik in Cand having the smallest lexicographic 
index. Denote a direct descendant of N by N' — (C ,Cand'). The first de- 
scendant is created by a recursive call with arguments C = C + {ik} and 
Cand' = Candf\T~Q(ik). This corresponds to exploring deeper the tree staying 
on the same branch and is realized by the step 7 of algorithm 1. Visiting the 
next direct descendent of N is done by a recursive call with arguments C' = C 
and Cand' — Cand \ {ik}. This corresponds to exploring a neighboring branch 
of the B&B tree (a wider move) and is realized by the step 8 of algorithm 1. 

3.1.2 Fathoming: 

A leaf in the B&B tree is interesting only if it contains a clique with a cardinality 
strictly greater than the cardinality of Cbest- Being given a B&B node (C, 
Cand), the current best clique Cbest, and a candidate vertex ik £ Cand, denote 
by MCCik(G) the maximum cardinality clique in G containing the vertex ik. 
If \MCCik(G)\ < \Cbest\, then we do not miss the solution by discarding ik 
from Cand (removing a vertex ik from Cand fathoms all the leafs of the current 
branch leading to a clique containing ik). In our approach we do not compute 
\MCCik(G)\ but we use some upper bounds (see section 3.2). 

Denote by dk the best clique that is found by branching on the vertex ik, 
and by MCC l k{G Cand ) the maximum cardinality clique in G Cand containing ik. 
It is easily seen that \C ik \ = \C\ + \MCC ik (G Cand )\. Any vertex ik £ Cand such 
that \MCC 2k (G Cand )\ < \C b est \ - \C\ leads to non-interesting leafs, and thus, 
can be removed from Cand. Again, we are going to use some upper bounds of 
\MCC ik {G Cand )\. 

3.1.3 Main algorithm: 

Denote by REMOV E{Cand,C, Cbest) a procedure which removes from Cand 
all vertices ik such that : (i) \MCC lk {G)\ < \C best \, or : (ii) \MCC lk {G Cand )\ < 
\Cbest\-\C\, according to some upper bounds of \MCC ik (G)\ and \MCC lk (G Cand )\. 
Algorithm 1 gives the global procedure MCC_BB for solving MCC{G). 
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Algorithm 1 MCC.BB(C, Cand, C best ) # initially called with C = 0, Cand = 

V and C best = 

l: if \C\ > \C bes t\ then 

2: Cbest <— C # Recording the new best clique. 
3: end if 

4: REMOVE (Cand, C, \Cb es t\) 

5: if Cand ^ then 

6: ik <- NEXT{Cand) 

7: MCC_BB(C + {ifc},Ccmd f| r£(ifc), Cb es t) # exploring deeper the ifc 
branch 

8: MCC_BB(C, Cand \ {ifc},Cb e st) # visiting another candidate (a wider 

move) 
9: end if 
10: return 



3.2 Maximum cardinality estimator 

The efficiency of the MCC_BB algorithm greatly depends on the ability of the 
procedure REMOVE to fathom non-interesting vertices from Cand. In order to 
do this, we need to tightly estimate \MCC lk {G)\ and \MCC ik {G Cand )\. These 
bounds are based on what we will call feasible paths in a grid. 

Definition 3 A feasible path in a grid G — {V, E} is an ordered sequence 
"i\ki, • • ■; itkt " of vertices E V, such that Vn G [1, t—1], (i n k n , i n+ \k n+ \) E 
E and i n < i n +i, k n < k n+1 . 

Denote by P(G) the longest (in terms of vertices) feasible path in G. Note that 
computing P(G) can be done by dynamic programming in 0(|-E|) time. 

3.2.1 Estimation of | MCC ik (G) | : 

For any vertex ik E V, we denote by Pi k {G) the longest feasible path in G con- 
taining ik, such that for any vertex jl ^ ik in the feasible path, jl is connected 
to ik (i.e. (ik,jl) E E or (jl,ik) E E). As illustrated in figure 1, Pi k (G) = 
P(A G (ik)) \J{ik} U P(A+(ifc)), and \P ik (G)\ - \P(A G (ik))\ + 1 + \P(A+(ik))\. 
It is easily seen that : 

\MCC lk (G)\ < \P lk (G)\,Vik E V. (10) 

Thus, any vertex ik E Cand such that \Pi k (G)\ < \Cb es t \ can be safely removed 
from Cand. Note that computing all \P ik (G)\ can be done once-for-all in a 
preprocessing step in 0(|V| x \E\) time. 

3.2.2 Estimation of \MCC lk (G Cand )\ : 

It is obvious that \MCC(G Cand )\ < \Cand\, so if |Cand| < |G 6est | - \C\ we can 
safely fathom all vertices from Cand. 

In the same way as we estimated \MCCi k (G)\, it is easily seen that : 

\MCC lk (G Cand )\ < \P lk (G Cand )\,\fik E Cand. (11) 



RR n° 6818 



8 



N. Malod-Dognin & R. Andonov & N. Yanev 




Figure 1: A graphical view of P ik {G). In this example, the longest feasible path 
in A^(ifc) contains 8 vertices, the longest feasible path in Ag(ifc) contains 3 
vertices, so the longest feasible path in G such that any vertex is connected to 
ik contains 12 vertices (including ik). 




Figure 2: A graphical view of Pik(G Cand ). In this example, by fixing the clique 
C, we created the subgraph Q Cand of the candidate vertices to be added to C. In 
this subgraph, for each candidate vertex ik, Pik(G Cand ) is found by computing 
the longest feasible paths in A GCand (ik) and in A GCand (ik). 



Any vertex ik G Cand such that \P ik {G Cand )\ < \C best \ - \C\ can be safely 
removed from Cand. In the subgraph G Cand — (Cand, Ec and), computing all 
|-Pife(G ;C ' a " <i )| can be done in 0(\Cand\ x \Ec a nd\) (see figure 2 for a graphical 
view of p k (G Cand )). 

Using the abovementioned estimators, we obtain the fathoming procedure 
REMOVE described in algorithm 2. 
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Algorithm 2 REMOVE {Cand, C, C best ) 

1: for ik G Cand do 

2: if P lk {G) < \C bes t \ then 

3: Cand <— Cand \ {ik}. 

4: end if 

5: end for 

6: if \Cand\ < \C best \ - \C\ then 

7: Cand <- {0} 

8: end if 

9: for ik G Cand do 

10: if P lfc (G Ca " d )<|C 6est |-|C|then 

11: Cand <— Cand \ {ik}. 

12: end if 

13: end for 

14: return 



4 Results 

All results presented here were obtained on the same desktop PC with one Intel 
Pentium A tm CPU at 3Ghz and 2GB of RAM. The mathematical programming 
based solver (MIP), was implemented in C using Hog Cplex 10.0 Callable Li- 
brary. The B&B solver, was also implemented in C. This two clique solvers were 
compared to the original VAST solver which is based on Bron and Kerbosh al- 
gorithm (BK) [7]. All algorithms were used to solve maximum cardinality clique 
problems. Note that in fact BK computes and evaluates all maximal cliques in 
a graph, and hence, can be used to solve any kind of clique problems. 

The comparison of the above algorithms was performed on real- life proteins. 
We used two different benchmarks which significantly differ by the size of the 
instances (number of SSEs). The first benchmark is the Skolnick set which was 
recently largely used in protein structure comparison papers [1, 12, 13]). This set 
contains 40 small protein chains (containing one domain), with a SSEs number 
varying from 5 to 20. The second set benchmark (S2) contains 36 long protein 
chains (containing from 4 to 6 domains), with a number of SSEs varying from 
51 to 87. Note that for the skolnick set, we only considered instances leading to 
an aligncmcnt graph with at least 100 vertices. 

The characteristics of the corresponding alignment grids are given in table 
1. One peculiarity is their low density, less than 20% for the Skolnick set, and 
less than 6% for S2 set. 



Set name 


Nur 
Min 


nber of vert 
Average 


ices 
Max 


IN 

Min 


umber of edf 
Average 


;es 

Max 


Min 


Density 
Average 


Max 


Skolnick 


100 


158.92 


208 


886 


2368.69 


3547 


0.16 


0.18 


0.20 


S2 


1390 


2384.97 


5582 


45278 


144206.44 


604793 


0.03 


0.05 


0.06 



Table 1: Characteristics of the grid graphs corresponding to the considered 
benchmarks. 



Figure 3 compares the time needed by MIP to the one of BK on the Skolnick 
set. On the 170 instances containing more than 100 vertices, MIP is always 
slower than BK (3.35 times slower in average). This is not surprising, since 
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dedicated solvers are expected to be faster than general purpose solvers (CPLEX 
in this case). However, this observation motivated us to go further in developing 
a fast special purpose clique solver. 

MIP vs BK running time comparison on Skolnick set 




0.16 



Figure 3: For each instance the execution time of MIP is plotted on the x-axis, 
while the one of BK is depicted on the y-axis. All points are above the x = y 
line (i.e. BK is always faster than MIP). 



Figure 4 compares the time needed by B&B to the one of BK on set S2. 
Here we observed that B&B is about 15.57 times faster than BK in average, 
and on the biggest instances (where both proteins contain more than 80 SSEs), 
it is up to 116.7 times faster. Such big instances are solved by B&B in less than 
79 seconds (25 sec. on average) while BK needs up to 2660 seconds (1521 sec. 
on average). These results are detailed in table 2. 



5 Conclusion 

We presented a new mathematical programming model for solving the maximum 
weighted clique problem arising in the context of protein structure comparison. 
This model was implemented and validated on a small benchmark. We also 
presented a new dedicated branch and bound algorithm for the maximum car- 
dinality clique problem. The computationnal results show that on big instances, 
our branch and bound is significantly faster than the Bron and Kerbosh algo- 
rithm (up to 116 times for the largest proteins). In the near futur, we intend to 
study the behavior of the proposed algorithms on arbitrary graphs, conveniently 
transformed into grid graphs in a preprocessing step. 
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B&B vs BK running time comparison on S2 set. 

Instances ' «~ 
y=x 

jr*:-' • 



10 20 30 40 50 60 70 80 

B&B time (sec.) 
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of BK is on the y-axis (in log scale). The x=y line is also given, and any point 
above it is an instance for which B&B is faster than BK. In average, B&B is 
15.57 times faster than BK. This superiority goes up to 116.7 times for the 
biggest instances. 
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\E\ 
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78 
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